HOW CLOSE IS THE SAMPLE COVARIANCE MATRIX TO 
THE ACTUAL COVARIANCE MATRIX? 



ROMAN VERSHYNIN 

Abstract. Given a probability distribution in R" with general (non- white) 
covariance, a classical estimator of the covariance matrix is the sample co- 
variance matrix obtained from a sample of N independent points. What is 
the optimal sample size N — N(n) that guarantees estimation with a fixed 
accuracy in the operator norm? Suppose the distribution is supported in a 
centered Euclidean ball of radius 0(y/n). We conjecture that the optimal 
sample size is N = 0(n) for all distributions with finite fourth moment, 
and we prove this up to an iterated logarithmic factor. This problem is 
motivated by the optimal theorem of M. Rudelson [23] which states that 
N = 0(n\ogn) for distributions with finite second moment, and a recent 
result of R. Adamczak et al. [Tj which guarantees that N = 0(n) for 
sub-exponential distributions. 



1. Introduction 

1.1. Approximation problem for covariance marices. Estimation of co- 
variance matrices of high dimensional distributions is a basic problem in mul- 
tivariate statistics. It arises in diverse applications such as signal processing 
[H], genomics (25], financial mathematics [16], pattern recognition [7], geo- 
metric functional analysis [23] and computational geometry [I]. The classical 
and simplest estimator of a covariance matrix is the sample covariance ma- 
trix. Unfortunately, the spectral theory of sample covariance matrices has not 
been well developed except for product distributions (or affine transformations 
thereof) where one can rely on random matrix theory for matrices with inde- 
pendent entries. This paper addresses the following basic question: how well 
does the sample covariance matrix approximate the actual covariance matrix 
in the operator norm? 

We consider a mean zero random vector X in a high dimensional space MJ 1 
and N independent copies X\, . . . , Xjv of X. We would like to approximate 
the covariance matrix of X 

£ = EX ® X = EXX T 



Partially supported by NSF grant FRG DMS 0918623. 

1 



2 



ROMAN VERSHYNIN 



by the sample covariance matrix 

Ejv = 

Problem. Determine the minimal sample size N = N(n,e) that guarantees 
with high probability (say, 0.99) that the sample covariance matrix Ejy approx- 
imates the actual covariance matrix £ with accuracy e in the operator norm 
^2 — > i-e- so that 

(1.1) \\E-E N \\<e. 

The use of the operator norm in this problem allows one a good grasp 
of the spectrum of S, as each eigenvalue of £ would lie within e from the 
corresponding eigenvalue of Sat. 

It is common for today's applications to operate with increasingly large num- 
ber of parameters n, and to require that sample sizes N be moderate compared 
with n. As we impose no a priori structure on the covariance matrix, we must 
have N > n for dimension reasons. Note that for some structured covariance 
matrices, such as sparse or having an off diagonal decay, one can sometimes 
achieve N smaller than n and even comparable to logn, by transforming the 
sample covariance matrix in order to adhere to the same structure (e.g. by 
shrinkage of eigenvalues or thresholding of entries) . We will not consider struc- 
tured covariance matrices in this paper; see e.g. [22] and [18]. 

1.2. Two examples. The most extensively studied model in random matrix 
theory is where X is a random vector with independent coordinates. However, 
independence of coordinates can not be justified in some important applica- 
tions, and in this paper we shall consider general random vectors. Let us 
illustrate this point with two well studied examples. 

Consider some non-random vectors xi, . . . , xm in IR n which satisfy Parseval's 
identity (up to normalization): 

1 M 

(1.2) — J2( x v x ) 2 = W x \\l for a11 x e R "- 

Such generalizations of orthogonal bases ( called tight frames. They 

arise in convex geometry via John's theorem on contact points of convex bodies 
[1] and in signal processing as a convenient mean to introduce redundancy 
into signal representations [IT] . From a probabilistic point of view, we can 
regard the normalized sum in (jl.2p as the expected value of a certain random 
variable. Indeed, Parseval's identity (11.21) amounts to jj J2jLi x j ® x j = I- 
Once we introduce a random vector X uniformly distributed in the set of M 
points {xi, . . . , xm}, Parseval's identity will read as EX ® X = I. In other 
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words, the covariance matrix of X is identity, £ = I. Note that there is no 
reason to assume that the coordinates of X are independent. 

Suppose further that the covariance matrix of X can be approximated by 
the sample covariance matrix S^v for some moderate sample size N = N(n, e). 
Such an approximation ||Ejv — 1\\ < £ means simply that a random subset of N 
vectors {xj 17 . . . , Xj N } taken from the tight frame {xi, . . . , Xm} independently 
and with replacement is still an approximate tight frame: 

1 - 

(1 " e)\\x\\t < ^>^> 2 < (1 + e)\\x\\ 2 2 for all x G R n . 

i=i 

In other words, a small random subset of a tight frame is still an approximate 
tight frame; the size of this subset N does not even depend on the frame size 
M. For applications of this type of results in communications see [28] . 

Another extensively studied class of examples is the uniform distribution on 
a convex body K in W 1 . A number of algorithms in computational convex 
geometry (for volume computing and optimization) rely on covariance estima- 
tion in order to put K in the isotropic position, see [12], [13] . Note that in this 
class of examples, the random vector uniformly distributed in K typically does 
not have independent coordinates. 

1.3. Sub-gaussian and sub-exponential distributions. Known results on 
the approximation problem differ depending on the moment assumptions on 
the distribution. The simplest case is when X is a sub-gaussian random vector 
in M. n , thus satisfying for some L that 

(1.3) P(|(X,x)| > t) < 2e-* 2/L2 for t > and x G S 11 ' 1 . 

Examples of sub-gaussian distributions with L = 0(1) include the standard 
Gaussian random distribution in W 1 , the uniform distribution on the cube 
[—1,1]™, but not the uniform distribution on the unit octahedron {x G M n : 
\xx\ + ■ ■ ■ + \x n \ < 1}. For sub-gaussian distributions in IR n , the optimal 
sample size in the approximation problem (11.11) is linear in the dimension, 
thus N = OL,e{n). This known fact follows from a large deviation inequality 
and an e-net argument, see Proposition 12.11 below. 

Significant difficulties arise when one tries to extend this result to the larger 
class of sub- exponential random vectors X, which only satisfy (11.31) with t 2 /L 2 
replaced by t/L. This class is important because, as follows from Brunn- 
Minkowski inequality, the uniform distribution on every convex body K is 
sub-exponential provided that the covariance matrix is identity (see [TUl Sec- 
tion 2.2. (D3)]). For the uniform distributions on convex bodies, a result of 
J. Bourgain [6] guaranteed approximation of covariance matrices with sample 
size slightly larger than linear in the dimension, N = £ (n log 3 n). Around the 
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same time, a slightly better bound N = O e (nlog 2 n) was proved by M. Rudel- 
son [23]. It was subsequently improved to iV = O e (n logn) for convex bodies 
symmetric with respect to the coordinate hyperplanes by A. Giannopoulos et 
al. [8], and for general convex bodies by G. Paouris [20]. Finally, an optimal 
estimate iV = O e (n) was obtained by G. Aubrun [3] for convex bodies with the 
symmetry assumption as above, and for general convex bodies by R. Adamczak 
et al. [T]. The result in pQ is actually valid for all sub-exponential distributions 
supported in a ball of radius 0(y/n). Thus, if X is a random vector in M n that 
satisfies for some K, L that 

(1.4) ||AT|| 2 < Ky/n a.s., F(\(X,x)\ > t) < 2e~ t/L for t > and x G S^ 1 

then the optimal sample size is A = OK,L,e( n )- 

The boundedness assumption ||X|| 2 = 0(y/n) is usually non- restrictive, 
since many natural distributions satisfy this bound with overwhelming prob- 
ability. For example, the standard Gaussian random vector in IR n satisfies 
this with probability at least 1 — e~ n . It follows by union bound that for any 
sample size N <C e n , all independent vectors in the sample X\, . . . , X N satisfy 
this inequality simultaneously with overwhelming probability. Therefore, by 
truncation one may assume without loss of generality that \\X\\2 = 0(y/n). A 
similar reasoning is valid for uniform distributions on convex bodies. In this 
case one can use the concentration result of G. Paouris [20] which implies that 
ll-^lh = 0(y/n) with probability at least 1 — e _v/ ". 

1.4. Distributions with finite moments. Unfortunately, the class of sub- 
exponential distributions is too restrictive for many natural applications. For 
example, discrete distributions in W 1 supported on less than e 0<y ^ points are 
usually not sub-exponential. Indeed, suppose a random vector X takes values 
in some set of M vectors of Euclidean length t/n. Then the unit vector x 
pointing to the most likely value of X witnesses that P(|(A, x)\ = y/n) > 1/M. 
It follows that in order for the random vector X to be sub-exponential with 
L = 0(1), it must be supported on a set of size M > e Cy/ ™. However, in 
applications such as (11. 2p it is desirable to have a result valid for distributions 
on sets of moderate sizes M, e.g. polynomial or even linear in dimension n. 
This may also be desirable in modern statistical applications, which typically 
operate with large number of parameters n that may not be exponentially 
smaller than the population size M. 

So far, there has been only one approximation result with very weak assump- 
tions on the distribution. M. Rudelson [23] showed that if a random vector X 
in R™ satisfies 

(1.5) ||X|| 2 < AV^ a.s., E(X, x} 2 < L 2 for x G S"" 1 
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then the minimal sample size that guarantees approximation ( 11. ip is iV = 
0K-,i,,e(n log n). The second moment assumption in (ll.5p is very weak; it is 
equivalent to the boundedness of the covariance matrix, ||£|| < L. The log- 
arithmic oversampling factor is necessary in this extremely general result, as 
can be seen from the example of the uniform distribution on the set of n vec- 
tors of Euclidean length \/n. The coupon collector's problem calls for the size 
N ~>n logn in order for the sample {Xi, . . . , X^} to contain all these vectors, 
which is obviously required for a nontrivial covariance approximation. 

There is clearly a big gap between the sub-exponential assumption (jl.4p 
where the optimal size is N ~ n and the weakest second moment assumption 
(11.51) where the optimal size is N ~ nlogn. It would be useful to classify the 
distributions for which the logarithmic oversampling is needed. The picture 
is far from complete - the uniform distributions on convex bodies in M. n for 
which we now know that the logarithmic oversampling is not needed are very 
far from the uniform distributions on 0(n) points for which the logarithmic 
oversampling is needed. We conjecture that the logarithmic oversampling is 
not needed for all distributions with q-th moment with appropriate absolute 
constant q; probably q = 4 suffices or even any q > 2. We will thus assume 
that 



Conjecture 1.1. Let X be a random vector in M. n that satisfies the moment 
assumption (11. 6p for some appropriate absolute constant q and some K , L. 
Let e > 0. Then, with high probability, the sample size N >k,l,e n suffices to 
approximate the covariance matrix £ of X by the sample covariance matrix 
in the operator norm: ||S — £jy|| < e. 

In this paper we prove the Conjecture up to an iterated logarithmic factor. 

Theorem 1.2. Consider a random vector X in ~R n (n > A) which satisfies 
moment assumptions (II. 6p for some q > 4 and some K, L. Let 5 > 0. Then, 
with probability at least 1—5, the covariance matrix £ of X can be approximated 
by the sample covariance matrix T, N as 



Remarks. 1. The notation a < q> K,L,5 b means that a < C(q,K,L,5)b where 
C(q, K, L,5) depends only on the parameters q,K,L,S; see Section 1231 for 
more notation. The logarithms are to the base 2. We put the restriction n > 4 
only to ensure that log log n > 1; Theorem 11.21 and other results below clearly 
hold for dimensions n = 1,2,3 even without the iterated logarithmic factors. 



(1.6) 



X\\ 2 < Ky/na.s., E\(X,x)\ q < L q for x G S' 
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2. It follows that for every e > 0, the desired approximation ||E — Sjv|| < £ 
is guaranteed if the sample has size 

N > q ,K,L,s,e (loglogn) p n where - + - = -. 

p q 4 

3. A similar result holds for independent random vectors X±, . . . ,Xn that 
are not necessarily identically distributed; we will prove this general result in 
Theorem 16.11 

4. The boundedness assumption ||X|| 2 < Ky/n in (11. 6ft can often be weak- 
ened or even dropped by a simple modification of Theorem 11.21 This happens, 
for example, if maxj<7v H^iH = 0(y/n) holds with high probability, as one 
can apply Theorem 11.21 conditionally on this event. We refer the reader to a 
thorough discussion of the boundedness assumption in Section 1.3 of |30j . 

1.5. Extreme eigenvalues of sample covariance matrices. Theorem 11.21 
can be used to analyze the spectrum of sample covariance matrices E^r- The 
case when the random vector X has i.i.d. coordinates is most studied in 
random matrix theory. Suppose that both N, n — > oo while the aspect ratio 
n/N — > (3 G (0,1]. If the coordinates of X have unit variance and finite 
fourth moment, then clearly S = I. The largest eigenvalue Ai(Sjv) then 
converges a.s. to (1 + y/]3) 2 , and the smallest eigenvalue XniT,^) converges a.s. 
to (1 — \f]3) 2 , see [5]. For more on the extreme eigenvalues in both asymptotic 
regime (N,n — > oo) and non-asymptotic regime (N,n fixed), see [24J. 

Without independence of the coordinates, analyzing the spectrum of sample 
covariance matrices S^r becomes significantly harder. Suppose that E = J. For 
sub-exponential distributions, i.e. those satisfying (jl.4p . it was proved in [2] 
that 

1 " O(y/0) < A n (£jv) < Ai^) < 1 + 0(y/p). 

(A weaker version with extra log(l//3) factors was proved earlier by the same 
authors in pQ.) Under only finite moment assumption (jl.6p . Theorem 11.21 
clearly yields 

1 - 0(loglogn)/3H < \ n (E N ) < Ai(£tf) < 1 + 0(loglogn)/3^l. 

1 _ 2 

Note that for large exponents q, the factor f3 2 * becomes close to yp. 

1.6. Norms of random matrices with independent columns. One can 

interpret the results of this paper in terms of random matrices with indepen- 
dent columns. Indeed, consider annxiV random matrix A = [X±, . . . ,Xn] 
whose columns X\, . . . , X^ are drawn independently from some distribution on 
]R n . The sample covariance matrix of this distribution is simply E^r = ■h AA T , 
so the eigenvalues of N 1 / 2 ^^ are the singular values of A. In particular, under 
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the same finite moment assumptions as in Theorem 11.21 we obtain the bound 
on the operator norm 

(1-7) \\A\\ < q , K , L , s \og\ogN-(V^+VN). 

This follows from a result leading to Theorem 11.21 see Corollary 15.21 The 
bound is optimal up to the log log N factor for matrices with i.i.d. entries, 
because the operator norm is bounded below by the Euclidean norm of any 
column and any row. For random matrices with independent entries, estimate 
(11.71) follows (under the fourth moment assumption) from more general bounds 
by Seginer [26] and Latala [15], and even without the log log N factor. Without 
independence of entries, this bound was proved by the author [29J for products 
of random matrices with independent entries and deterministic matrices, and 
also without the log log N factor. 

1.7. Organization of the rest of the paper. In the beginning of Section [2] 
we outline the heuristics of our argument. We emphasize its two main ingre- 
dients - structure of divergent series and a decoupling principle. We finish 
that section with some preliminary material - notation (Section I2.3p . a known 
argument that solves the approximation problem for sub-gaussian distribu- 
tions (Section 12.41) . and the previous weaker result of the author [3U] on the 
approximation problem in the weak £2 norm (Section |2.5j) . 

The heart of the paper are Sections [3] and |U In Section [3] we study the 
structure of series that diverge faster than the iterated logarithm. This struc- 
ture is used in Section @] to deduce a decoupling principle. In Section [5] we 
apply the decoupling principle to norms of random matrices. Specifically, in 
Theorem 15.11 we estimate the norm of ^2 ieE X^ £g> X^ uniformly over subsets 
E. We interpret this in Corollary 15.21 as a norm estimate for random matrices 
with independent columns. In Section [HI we deduce the general form of our 
main result on approximation of covariance matrices, Theorem 16.11 

Acknowledgement. The author is grateful to the referee for useful sugges- 
tions. 

2. Outline of the method and preliminaries 

Let us now outline the two main ingredients of our method, which are a new 
structure theorem for divergent series and a new decoupling principle. For the 
sake of simplicity in this discussion, we shall now concentrate on proving the 
weaker upper bound ||£jv|| = 0(1) in the case N = n. Once this simpler case 
is understood, the full Theorem 11.21 will require a little extra effort using a 
now standard truncation argument due to J. Bourgain [6]. We thus consider 
independent copies Xi, . . . , X n of a random vector X in MJ 1 satisfying the finite 
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moment assumptions ( 11. 6ft . We would like to show with high probability that 




In this expression we may recongize a stochastic process indexed by vectors x 
on the sphere. For each fixed x, we have to control the sum of independent ran- 
dom variables ^(^i,^) 2 with finite moments. Suppose the bad event occurs 
- for some x, this sum is significantly larger than n. Unfortunately, because of 
the heavy tails of these random variables, the bad event may occur with poly- 
nomial rather than exponential probability n~°^\ This is too weak to control 
these sums for all x simultaneously on the n-dimensional sphere, where e-nets 
have exponential sizes in n. So, instead of working with sums of independent 
random variables, we try to locate some structure in the summands responsible 
for the largeness of the sum. 

2.1. Structure of divergent series. More generally, we shall study the 
structure of divergent series £\ bi = oo, where 6« > 0. Let us first suppose 
that the series diverges faster than logarithmic function, thus 

n 

bi 3> logn for some n > 2. 

i=l 

Comparing with the harmonic series we see that the non-increasing rearrange- 
ment b* of the coefficients at some point must be large: 

6* 3> 1/ni for some n\ < n. 

In other words, one can find n\ large terms of the sum: there exists an index 
set I C [n] of size |/| = n\ and such that bi 3> 1/ni for j e J. This collection 
of large terms (&i)i e j forms a desired structure responsible for the largeness 
of the series Such a structure is well suited to our applications where 

bi are independent random variables, bi = (Xi,x) 2 /n. Indeed, the events 
{bi 3> 1/ni} are independent, and the probability of each such event is easily 
controlled by finite moment assumptions (12. 2p through Markov's inequality. 
This line was developed in [3D], but it clearly leads to a loss of logarithmic 
factor which we are trying to avoid in the present paper. 

We will work on the next level of precision, thus studying the structure of 
series that diverge slower than the logarithmic function but faster than the 
iterated logarithm. So let us assume that 

n 

b* < 1/i for all i; log logn for some n > 4. 

i=l 

In Proposition 13.11 we will locate almost the same structure as we had for 
logarithmically divergent series, except up to some factor log log n <C / < log n, 
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as follows. For some n% < n there exists an index set / C [n] of size |J| = m, 
such that 

bi 3> - — for i 6 /, and moreover — > 2^ 2 . 
mi rii 

2.2. Decoupling. The structure that we found is well suited to our applica- 
tion where bi are independent random variables bi = (Xi,x) 2 /n. In this case 
we have 

(2.1) (X i ,x) 2 »-^>-/log(-) >{n/ ni Y-^ fori el. 

ln\ rii \n\/ 

The probability that this happens is again easy to control using independence 
of (Xi, x) for fixed x, finite moment assumptions (I2.2p and Markov's inequality. 
Since there are ( ™ ) number of ways to choose the subset I, the probability of 
the event in (12. ip is bounded by 

where the last inequality follows because (™J < (en/n 1 ) ni and since q > 2. 

Our next task is to unfix x G S" 1-1 . The exponential probability estimate we 
obtained allows us to take the union bound over all x in the unit sphere of any 
fixed ni-dimensional subspace, since this sphere has an e-net of size exponential 
in n\. We can indeed assume without loss of generality that the vector x in 
our structural event (12.11) lies in the span of (Xj)j £ / which is ni-dimensional; 
this can be done by projecting x onto this span if necessary. Unfortunately, 
this obviously makes x depend on the random vectors (Xj) ig / and destroys the 
independence of random variables (Xi,x). This hurdle calls for a decoupling 
mechanism, which would make x in the structural event (12.11) depend on some 
small fraction of the vectors (Xj)j e /. One would then condition on this fraction 
of random vectors and use the structural event (12. ip for the other half, which 
would quickly lead to completion of the argument. 

Our decoupling principle, Proposition 14.11 is a deterministic statement that 
works for fixed vectors X,. Loosely speaking, we assume that the structural 
event (12. ip holds for some x in the span of (Xj) ig/ , and we would like to force 
x to lie in the span of a small fraction of these Xj. We write linear 
combination x = Yl,i^i c i-^i- The first step of decoupling is to remove the 
"diagonal" term CjXj from this sum, while retaining the largeness of (Xi,x). 
This task turns out to be somewhat difficult, and it will force us to refine our 
structural result for divergent series by adding a domination ingredient into 
it. This will be done at the cost of another log log n factor. After the diagonal 
term is removed, the number of terms in the sum for x will be reduced by a 
probabilistic selection using Maurey's empirical method. 
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2.3. Notation and preliminaries. We will use the following notation through- 
out this paper. C and c will stand for positive absolute constants; C p will de- 
note a quantity which only depends on the parameter p, and similar notation 
will be used with more than one parameter. For positive numbers a and b, the 
asymptotic inequality a < b means that a < Cb. Similarly, inequalities of the 
form a < Pi9 b mean that a < C p>q b. Intervals of integers will be denoted by 
[n] := {1, . . . , \n\} for n > 0. The cardinality of a finite set I is denoted by 
|/|. All logarithms will be to the base 2. 

The non-increasing rearrangement of a finite or infinite sequence of numbers 
a = (cii) will be denoted by (a*). Recall that the £ p norm is defined as ||a|| p = 
l a i| P ) 1//p for 1 < p < oo, and Halloo = maxj |a$|. We will also consider the 
weak £ p norm for 1 < p < oo, which is defined as the infimum of positive 
numbers M for which the non-increasing rearrangement (|a|*) of the sequence 
(|aj|) satisfies \a\* < Mi~ l l p for all i. For sequences of finite length n, it 
follows from definition that the weak £ p norm is equivalent to the £ p norm up 
to a O(logn) factor, thus ||a|| Pi00 < ||a|| p < logn • ||a|| Pi oo for a G 

In this paper we deal with the £2 £2 operator norm of n x n matrices ||A||, 
also known as spectral norm. By definition, 

||y4|| = sup ||Ar|| 2 

where S n ~ l denotes the unit Euclidean sphere in W 1 . Equivalently, ||A|| is 
the largest singular value of A and the largest eigenvalue of V AA T . We will 
frequently use that for Hermitian matrices A one has 

= sup \(Ax, x)\. 



It will be convenient to work in a slightly more general than in Theorem 1 1.2 1 
and consider independent random vectors Xi in MJ 1 that are not necessarily 
identically distributed. All we need is that moment assumptions (11. 6p hold 
uniformly for all vectors: 



n— 1 



(2.2) ||X 4 || 2 < ^a.s, (E\(Xi,x)\ q ) 1/q < L for all x <E S 

We can view our goal as establishing a law of large numbers in the operator 
norm, and with quantitative estimates on convergence. Thus we would like to 
show that the approximation error 

ll 1 N l 1 - 

( 2 - 3 ) W^yZx^X.-EX^X, = sup - y2(X u x) 2 -E(X h x) 

II iV -ren-l I iV 



i=l i=l 



is small like in Theorem 11.21 
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2.4. Sub-gaussian distributions. A solution to the approximation problem 
is well known and easy for sub-gaussian random vectors, those satisfying (11.31) . 
The optimal sample size here is proportional to the dimension, thus A" = 
OL,e{ n )- F° r the reader's convenience, we recall and prove a general form of 
this result. 

Proposition 2.1 (Sub-gaussian distributions). Consider independent random 
vectors X\, . . . in W 1 , N > n, which have sub-gaussian distribution as in 
( II. 3p for some L. Then for every 5 > with probability at least 1 — 5 one has 

1 N 

— J2 x i ® Xi - EXi ® X< I < L , S (- 



N 

i=l 



n\2 



One should compare this with our main result, Theorem 11.21 which yields 
almost the same conclusion under only finite moment assumptions on the dis- 
tribution, except for an iterated logarithmic factor and a slight loss of the 
exponent 1/2 (the latter may be inevitable when dealing with finite moments). 

The well known proof of Proposition 12.11 is based on Bernstein's deviation 
inequality for independent random variables and an e-net argument. The latter 
allows to replace the sphere 5 n_1 in the computation of the norm in (12. 3p by 
a finite e-net as follows. 

Lemma 2.2 (Computing norms on e-nets). Let A be a Hermitian nxn matrix, 
and let Af £ be an e-net of the unit Euclidean sphere S"™ -1 for some e G [0, 1). 
Then 

\\A\\= sup \(Ax,x)\ < (1 - 2e'y l sup \(Ax,x}\. 

Proof. Let us choose x G S n ~ l for which ||v4|| = \(Ax, x)\, and choose y G M e 
which approximates x as \\x — y|| 2 < e. It follows by the triangle inequality 
that 

\{Ax,x) - (Ay,y)\ = \(Ax,x- y) + (A(x - y),y)\ 

< \\A\\ \\x\\ 2 \\x - y\\ 2 + \\A\\ \\x - y\\ 2 \\yh < 2 \\ A ¥- 

It follows that 

\{Ay,y)\ > \(Ax,x)\ - 2\\A\\e = (1 - 2e)||A||. 
This completes the proof. □ 



Proof of Proposition \2.1\ Without loss of generality, we can assume that in 
the sub-gaussian assumption (11.31) we have L = 1 by replacing Xi by Xi/L. 
Identity (12. 3p expresses the norm in question as a supremum over the unit 
sphere S"™ -1 . Next, Lemma I2T21 allows to replace the sphere in (12. 3p by its 1/2- 
net M at the cost of an absolute constant factor. Moreover, we can arrange so 
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that the net has size \Af\ < 6 n ; this follows by a standard volumetric argument 
(see [T71 Lemma 9.5]). 

Let us fix x G Af. The sub-gaussian assumption on X { implies that the 
random variables (Xi,x) 2 are sub-exponential: P((Xj,x) 2 > t) < 2e~* for t > 
0. Bernstein's deviation inequality for independent sub-exponential random 
variables (see e.g. [271 Section 2.2.2]) yields for all e > that 

1 N 

(2.4) P{ | - J^iXi, x) 2 - E(X h x) 2 \>e}< 2e~^ N . 

i=i 

Now we unfix x. Using ( 12.41) for each x in the net A/", we conclude by the union 
bound that the event 

1 - 

| — ^(X,,x) 2 -E(X h x) 2 \ < e for all xE N 

i=l 

holds with probability at least 

l-\Af\- 2e- C£ ' 2N > 1 - 2e 2n ~ C£2N . 

Now if we choose e 2 = (4/c) log(2/5) n/N, this probability is further bounded 
below by 1 — 5 as required. By the reduction from the sphere to the net 
mentioned in the beginning of the argument, this completes the proof. □ 

2.5. Results in the weak £ 2 norm, and almost orthogonality of Xj. A 

truncation argument of J. Bourgain [6] reduces the approximation problem to 
finding an upper bound on 

H^XjtglX; = SUp ^2(Xi,x) 2 = SUp ||((Xi,x))i 6B ||2 

uniformly for all index sets E C [N] with given size. A weaker form of this 
problem, with the weak £2 norm of the sequence (Xi,x) instead of the its £2 
norm, was studied in [30]. The following bound was proved there: 

Theorem 2.3 ([30] Theorem 3.1). Consider random vectors X\, . . . , X^ which 
satisfy moment assumptions (12. 2p for some q > 4 and some K, L. Then, for 
every t > 1, with probability at least 1 — Ct~ Q ^ q one has 



sup \\{(Xi,x)) ieE \\l < q ,K,L n + t 2 \E\ forallEC[N}. 



For most part of our argument (through decoupling), we treat Xi as fixed 
non-random vectors. The only property we require from Xi is that they are 
almost pairwise orthogonal. For random vectors, an almost pairwise orthogo- 
nality easily follows from the moment assumptions ( 12. 2ft : 
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Lemma 2.4 ([30] Lemma 3.3). Consider random vectors Xi, . . . ,Xn which 
satisfy moment assumptions (12. 2p for some q > 4 and some K, L. Then, for 
every t > 1, with probability at least 1 — Ct~ q one has 

( 2 - 5 ) W\ S ( X - X *> 2 ^K,L t 2 (^X' q n for all E C [N], k G [N]. 

3. Structure of divergent series 

In this section we study the structure of series which diverge slower than the 
logarithmic function but faster than an iterated logarithm. This is summarized 
in the following result. 

Proposition 3.1 (Structure of divergent series). Let a G (0,1). Consider a 
vector b = (b\, . . . , b m ) G M. m (m > 4) that satisfies 

(3.1) ||6||i,oo<l, ||&||i> a (log logm) 2 . 

Then there exist a positive integer I < logm and a subset of indices l\ C [m] 
such that the following holds. Given a vector A = (Aj)^^ such that ||A||i < 1, 
one can find a further subset I 2 Q h with the following two properties, 
(i) (Regularity): the sizes n x := \I X \ and n 2 := |/ 2 | satisfy 

2 2/2 < _ < _ < _ 



(ii) (Largeness of coefficients): 

\bi\ >-^—forie h; 

\bA > and \bA > 2 1 A» I for i G I2 ■ 

tn 2 

Furthermore, we can make I > C a log log m with arbitrarily large C a by 
making the dependence on a implicit in the assumption (13. ip sufficiently large. 

Remarks. 1. Proposition 13.11 is somewhat nontrivial even if one ignores the 
vector A and the further subset I 2 - In this simpler form the result was intro- 
duced informally in Section 12.11 The structure that we find is located in the 
coefficients 6j on the index set I\. Note that the largeness condition (ii) for 
these coefficients is easy to prove if we disregard the regularity condition (i). 
Indeed, since ||6||i,oo ^ (logm) -1 \\b\\i ^> 1/logm, we can choose I = logm and 
obtain a set 7j satisfying (ii) by the definition of the weak t 2 norm. But the 
regularity condition (i) guarantees the smaller level 

I < log(m/ni) 

which will be crucial in our application to decoupling. 
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2. The freedom to choose A in Proposition 13.11 ensures that the structure 
located in the set I\ is in a sense hereditary; it can pass to subsets l 2 . The 
domination of A by b on I 2 will be crucial in the removal of the diagonal terms 
in our application to decoupling. 

We now turn to the proof of Proposition 13 . 11 Heuristically, we will first find 
many (namely, I) sets Ii on which the coefficients are large as in (ii), then 
choose one that satisfies the regularity condition (i). This regularization step 
will rely on the following elementary lemma. 

Lemma 3.2 (Regularization). Let N be a positive integer. Consider a nonempty 
subset J C [L] with size I := \ J\. Then, for every a G (0,1), there exist ele- 
ments ji,j 2 G J that satisfy the following two properties, 
(i) (Regularity): 

l/2<ji<h<(l + a)j!. 



(ii) (Density): 



\Jn\jiJ 2 }\>c 



log(2L//)- 

Proof. We will find ji, j 2 as some consecutive terms of the following geometric 
progression. Define G J to be the (unique) element such that 

|Jn[l,jW]| = |7/2"|, andletj( fc ):=(l + «)j( fc - 1 ), fc = l,2,.... 

We will only need to consider K terms of this progression, where K := min{A; : 
> L}. Since /°) > \l/2] > 1/2, we have jW > (1 + a) k j { ^ > (1 + a) k l/2. 
On the other hand, < L. It follows that K < a log(2L/7). 

We claim that there exists a term 1 < k < K such that 

(3.2) \Jn[j^, 3 ^]\>J-> c 



3K ~ a log(2L/0 
Indeed, otherwise we would have 

i = |J|<|JnM°)]|+f;|Jn^*- 1 ),jW]|< rl 

k=l 



2 



which is impossible. 

The terms ji := and j 2 '■= j^ k ' for which (13. 2p holds clearly satisfy (i) 

and (ii) of the conclusion. By increasing j\ and decreasing j 2 if necessary we 
can assume that ji,j 2 G J. This completes the proof. □ 

Proof of Proposition \3.1\ We shall prove the following slightly stronger state- 
ment. Consider a sufficiently large number 

K > a log log m. 
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Assume that the vector b satisfies 

||6||i,oo<l, H&Hx^tfloglogm. 

We shall prove that the conclusion of the Proposition holds with (ii) replaced 
by: 

(3.3) \k\ > for i e h; 

(3.4) > > 2^1 for i G / 2 . 

2m 2 

We will construct Ii and I 2 in the following way. First we decompose the 
index set [m] into blocks Qi, . . . , Ql on which the coefficients have similar 
magnitude; this is possible with L ~ logm blocks. Using the assumption 
IHIi ~ (log logm) 2 , one easily checks that many (at least / ~ log logm) of 
the blocks flj have large contribution (at least 1/j) to the sum = \bi\. 
We will only focus on such large blocks in the rest of the argument. At this 
point, the union of these blocks could be declared I\. We indeed proceed this 
way, except we first use Regularization Lemma 13.21 on these blocks in order 
to obtain the required regularity property (ii). Finally, assume we are given 
coefficients (Aj)^^ with small sum \ W ^ 1 as in the assumption. Since 
the coefficients bi are large on Ii by construction, the pigeonhole principle will 
yield (loosely speaking) a whole block of coefficients Qj where bi will dominate 
as required, \bi\ > 2|A$|. We declare this block I 2 and complete the proof. Now 
we pass to the details of the argument. 

Step 1: decomposition of [in] into blocks. Without loss of generality, 

— < bi < 1, Aj > for all i e [ml. 
m 

Indeed, we can clearly assume that 6j > and Aj > 0. The estimate 6j < 1 
follows from the assumption: ||6||oo ^ ll^l|i,oo ^ 1- Furthermore, the contribu- 
tion of the small coefficients b{ < 1/m to the norm is at most 1, while 
by the assumption \\b\\i > Q K log logm > 2. Hence we can ignore these small 
coefficients by replacing [m] with the subset corresponding to the coefficients 
bi > 1/m. 

We decompose [m] into disjoint subsets (which we call blocks) according to 
the magnitude of bi, and we consider the contribution of each block Qj to the 
norm ||6||i: 

Qj := {i E [m] : 2~ j <h< 2^' +1 }; rrij := \Qj\; B } := ^ b { . 
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By our assumptions on b, there are at most logm nonempty blocks Qj. As 
II ^ II 1,00 < 1) Markov's inequality yields for all j that 

(3.5) rrij <^2m k =\{iE [m] : h > 2~ j }| < 2 j ; 

(3.6) Bj < m i 2 _i+1 < 2. 

Only the blocks with large contributions Bj will be of interest to us. Their 
number is 

/ := max {j G [logm] : B* > K/jj; 

and we let Z = if it happens that all Bj < K/j. We claim that there are 
many such blocks: 

(3.7) —K log logm < / < logm. 

5 

Indeed, by the assumption and using (13.61) we can bound 

log m 

K log logm < || b ||i = B*<21 + 0.6-fT log logm, 
which yields (13. 7ft . 

Step 2: construction of the set Ii. As we said before, we are only 
interested in blocks fl, with large contributions Bj. We collect the indices of 
such blocks into the set 

J:= {j e [logm]: Bj > K/l}. 

Since the definition of / implies that B* > K/l, we have |J| > I. Then we can 
apply Regularization Lemma 13.21 to the set {logm — j : j £ J} C [logm]. 
Thus we find two elements j',j" G J satisfying 

(3.8) 1/2 < logm — j' < logm — j" < (1 + a/2) (logm — j'), 
and such that the set 

J'= Jn[j",j') 

has size |J| > a //log logm. Since by our choice of K we can assume that 
K > 8 log log m, we obtain 

(3.9) \J\ > |. 
We are going to show that the set 

h ■= |J Qi 

jeJ 

satisfies the conclusion of the Proposition. 
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Step 3: sizes of the coefficients bi for i G 1\. Let us fix j G J C J. 

From the definition of J we know that the contribution Bj is large: Bj > K/l. 
One consequence of this is a good estimate of the size raj of the block Qj. 
Indeed, the above bound together with (13.61) this implies 

K ■ 

(3.10) — 2 J < m 3 < I 3 for j G J. 

Another consequence of the lower bound on Bj is the required lower bound on 
the individual coefficients bi. Indeed, by construction of Qj the coefficients bi, 
i G Qj are within the factor 2 from each other. It follows that 

In particuar, since by construciton Qj C I l7 we have rrij < which implies 

, ^ K c ■ r 

h -m\ h - 

We have thus proved the required lower bound (13.31) . 

Step 4: Construction of the set I 2 , and sizes of the coefficients bi 

for i G -?2- Now suppose we are given a vector A = (Aj)^ with ||A||i < 1. 
We will have to construct a subset I2 C h as in the conclusion, and we will do 
this as follows. Consider the contribution of the block Qj to the norm ||A||i: 

On the one hand, the sum of all contributions is bounded as X^'eJ A? = II^IK — 
1. On the other hand, there are many terms in this sum: |J| > 81 /K as we 
know from (13.91) . Therefore, by the pigeonhole principle some of the contribu- 
tions must be small: there exists jo £ J such that 

L j0 < K/81. 

This in turn implies via Markov's inequality that most of the coefficients Aj 
for i G Qj are small, and we shall declare these set of indices J2. Specifically, 
since Lj = J2ien 3 ^3 — and \Qj \ = nij , using Markov's inequality we 

see that the set 



h ■= {1 G Q jo : Xi < ——} 



4lm jo 



has cardinality 



(3.12) -m jo < \I 2 \ < m jo . 
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Moreover, using ( 13. lip , we obtain 

K 

h > — > 2Xi for i G h- 

2lm jo 

We have thus proved the required lower bound (13.41) . 

Step 5: the sizes of the sets 1\ and li. It remains to check the regularity 
property (i) of the conclusion of the Proposition. We bound 

| Ji| = ^2i m o (by definition of l\) 
— m -? definition of J) 

j<3' 

<2>. (by©) 
Therefore, using (13.81) we conclude that 

(3.13) — > 2 losm ~ j ' > 2 l/2 . 

We have thus proved the first inequality in (i) of the conclusion of the Propo- 
sition. Similarly, we bound 

\h\>\m j0 (by (EI2])) 

K ■ 

> — 2 J0 (by (13TI0|) . and since j G J) 

K a 

> — 2 J . (by definition of J, and since j G J) 



Therefore 



\h\ ~ K 

< — (logm - /^(i+^Xiogm-j') ( by (1X51) ) 
K 

< | (logm - j'^a+^Xiogm-j') (by the assumption on K) 

< 2( 1+Q )( lo s m -i') 

l+Q 



<(^l) • (by (EH) 



\h 

This completes the proof of (i) of the conclusion, and of the whole Proposi- 
tion EU □ 
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4. Decoupling 

In this section we develop a decoupling principle, which was informally intro- 
duced in Section [2721 In contrast to other decoupling results used in probabilis- 
tic contexts, our decoupling principle is non-random. It is valid for arbitrary 
fixed vectors Aj which are almost pairwise orthogonal as in (12. 5p . An exam- 
ple of such vectors are random vectors, as we observed earlier in Lemma 12.41 
Thus in this section we will consider vectors A 1; . . . , X m G M. n that satisfy the 
following almost pairwise orthogonality assumptions for some r' > 1, A 1; A 2 : 

\\Xi\\ 2 < K X yfn\ 

|-^| (Xi,X k ) 2 <K^^y /r 'n for all E C [m], k G [m\. 

In the earlier work [30] we developed a weaker decoupling principle, which 
was valid for the weak l 2 norm instead of £ 2 norm. Let us recall this result 
first. Assume that for vectors A; satisfying (14. ip with r' — r one has 

/ N\ 1 / r 

sup \\((X u x))™ =1 \\l > rtKl ,K 2 n+( — ) m. 
ices™- 1 Vm/ 

Then the Decoupling Proposition 2.1 of [30] implies that there exist disjoint 
sets of indices /, J C [m] such that |J| < S\I\, and there exists a vector 
y G S' n_1 fl spa,n(Xj)j e j, such that 

/ TV \ 1/r 

(X l} y} 2 > ^) for tel. 

Results of this type are best suited for applications to random independent 
vectors Aj. Indeed, the events that (Xi,y) 2 is large are independent for i E I 
because y does not depend on (Aj)j g /. The probability of each such event is 
easy to bound using the moment assumptions (12. 2p . 

In our new decoupling principle, we replace the weak l 2 norm by the i 2 norm 
at the cost of an iterated logarithmic factor and a slight loss of the exponent. 
Our result will thus operate in the regime where the weak l 2 norm is small 
while i 2 norm is large. We summarize this in the following proposition. 

Proposition 4.1 (Decoupling). Let n > 1 and A < m < N be integers, and let 
1 < r < min(r',r") and 5 G (0, 1). Consider vectors X±, . . . ,X m G W 1 which 
satisfy the weak orthonormality conditions (14.11) for some Ki,K 2 . Assume that 
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for some K 3 > max(Ki, K 2 ) one has 
(4.2) 
(4.3) 



sup wm^txWu^K. 



n+ [ — 

m 



m 



i=l 



sup 



>r,r',r",5 *~ ' 



Kl (log log m) 5 



8=1 



n + ( — ) m 

m/ 



T/ien there exist nonempty disjoint sets of indices I,JC [m] s«c/i that \ J\ < 
5\I\, and there exists a vector y G S*™ -1 fl span(Xj)j £ j ; swc/i that 

N \ l/r" 



/ iv \ v 7- 

(X ^ )2 ^^ 2 ( m ) /or i G I. 



The proof of the Decoupling Proposition 14.11 will use Proposition 13.11 in 
order to locate the structure of the large coefficients (Xi,x). The following 
elementary lemma will be used in the argument. 

Lemma 4.2. Consider a vector A = (Ai, . . . , A n ) G IR n which satisfies 

||A||i<l, ||A||oo < 1/K 

for some integer K . Then, for every real numbers (ai, . . . , a n ) G M n one has 



n K 

i=i j=i 

Proof. It is easy to check that each extreme point of the convex set 

A := {A G R n : ||A||i < 1, || AHoo < 1/K} 

has exactly K nonzero coefficients which are equal to ±1/X. Evaluating the 
linear form ^ A^j on these extreme points, we obtain 

n n ^ K 

sup^Aiai= sup ^ Xidi = — ^ a- . 

j=i 



ASA - , 



Aeext(A) 



j=l 



The proof is complete. 



□ 



Proof of Decoupling Proposition ^. 1\ By replacing Xj with Xi/K^ we can as- 
sume without loss of generality that K\ = K 2 = K% = 1. By perturbing the 
vectors X, slightly we may also assume that Xi are all different. 

Step 1: separation and the structure of coefficients. Suppose the 
assumptions of the Proposition hold, and let us choose a vector x G S n ^ 1 
which attains the supremum in (14.31) . We denote 

<2j := (Xi,x) for i G [m], 
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and without loss of generality we may assume that Q{ ^ 0. We also denote 



n := n + ( — m. 

.m/ 



We choose parameter a = a(r,r',r",5) G (0,1) sufficiently small; its choice 
will become clear later on in the argument. At this point, we may assume that 



\ a \\ioo < n, \\a\\l > a (loglogm) 2 n. 



We can use Structure Proposition 13. II to locate the structure in the coefficients 
Oj. To this end, we apply this result for 6, = af/n and obtain a number 
I < logm and a subset of indices 1\ C [to]. We can also assume that I is 
sufficiently large - larger than an arbitrary quantity which depends on a. 

Since a vector x G S" 1 ' 1 satisfies (Xi/di,x) = 1 for all i G I\ (in fact for 
all % G [m]), a separation argument for the convex hull K := conv(X i /ai)j e / 1 
yields the existence of a vector x G conv(i\" U 0) that satisfies 

(4.4) IN| 2 = 1, (Xi/oi,x)>l for t Eh. 
We express x as a convex combination 

(4.5) x = ^^AjXj/ai for some Aj > 0, A; < 1. 

ieh ieh 

We then read the conclusion of Structure Proposition 13.11 as follows. There 
exists a futher subset of indices I2 C such that the sizes n\ := and 
^2 := 1-^2 1 are regular in the sense that 

4.6 2 ;/2 < — < — <( — ) , 

rii n 2 \n\/ 

and the coefficients on Ii and /2 are large: 

(4.7) a 2 > — for i G A, 

mi 

(4.8) a. 2 > - — and a 2 > 2\n for z G J 2 . 

ln 2 

Furthermore, we can make / sufficiently large depending on a, say I > 100/a 2 . 

Step 2: random selection. We will reduce the number of terms n\ in 
the sum (14.51) defining x using random selection, trying to bring this number 
down to about n 2 - As is usual in dealing with sums of independent random 
variables, we will need to ensure that all summands \Xijai have controlled 
magnitudes. To this end, we have || Xj|| 2 < y/n by the assumption, and we can 
bound through (14. 7p . Finally, we have an a priori bound Aj < 1 on the 
coefficients of the convex combination. However, the latter bound will turn out 
to be too weak, and we will need Aj < a l/n 2 instead. To make this happen, 
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instead of the sets I\ and J 2 we will be working on their large subsets I[ and 
I' 2 defines as 

I[ := {i & h ■ Aj < % I' 2 := G J 2 : A, < ^} 

n 2 n 2 

where C a is a sufficiently large quantity whose value we will choose later. By 
Markov's inequality, this incurs almost no loss of coefficients: 

(4.9) \h\H\<^ \h\r 2 \<^. 

We will perform a random selection on I\ using B. Maurey's empirical 
method [21J. Guided by the representation ( 14.51) of x as a convex combi- 
nation, we will treat Aj as probabilities, thus introducing a random vector V 
with distribution 

F{V = Xi/cn} = A i5 i G J(. 

On the remainder of the probability space, we assign V zero value: P{V = 
0} = 1 — Yuiei' Consider independent copies Vi, V 2 , . . . of V. We are not 
going to do a random selection on the set I\ \ I[ where the coefficients Aj may 
be out of control, so we just add its the contribution by defining independent 
random vectors 

Y 3 : = V 3 + Yl XlXi<Xi ' J = 1, 2, . . . 

Finally, for C' a := C a /a, we consider the average of about n 2 such vectors: 

p, n 2 /C' a 

(4.10) y--=zfJ2 Y r 

2 3=1 

We would like to think of y as a random version of the vector x. This is 
certainly true in expectation: 



EY 1 = A^X./c 



x. 

■ieh 

Also, like x, the random vector y is a convex combination of terms Xi/di (now 
even with equal weights). The advantage of y over x is that it is a convex 
combination of much fewer terms, as n 2 /C^ n 2 < n\. In the next two 
steps, we will check that y is similar to x in the sense that its norm is also 
well bounded above, and at least ~ n 2 of the inner products (Xi/di, y) are still 
nicely bounded below. 
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Step 3: control of the norm. By independence, we have 



E\\y -x\\l = E 



n , n2/0' a n 2 /C' a 



< —E\\Yi - EYAl = —E\\V - EV\\l < —E\\V\\l 
n 2 n 2 n 2 

= — \i\\Xi\\ 2 2 / aj < — max HXiHa/a-, 
n 2 ^—f. n 2 ieh 

where the last inequality follows because I[ C I x and J2ieh — 1- 
Since n > n, (14 .7p gives us the lower bound 

2 ^ W r • 

a, > - — tor i G li. 
Ini 

Together with the assumption < n, this implies that 

n2 njin\ n 2 

Since = 1 < lni/n 2 , we conclude that with probability at least 0.9, one 
has 

(4-11) | <a-. 

n 2 

Step 4: removal of the diagonal term. We know from (14 .4j) that 
(Xi/cii,x) > 1 for many terms Xj. We would like to replace x by its ran- 
dom version y, establishing a lower bound (Xk/at, y) > 1 for many terms 
But at the same time, our main goal is decoupling, in which we would need 
to make the random vector y independent of those terms X k . To make this 
possible, we will first remove from the sum (14. 10p defining y the "diagonal" 
term containing X^, and we call the resulting vector y^ k \ 

To make this precise, let us fix k G V 2 C I[ C I x . We consider independent 
random vectors 

Note that 

(4.12) P{Yf } ^ ^} = P{vf } ^ Vj} = F{Vj = X k /a k } = A fc . 
Similarly to the definition (14.101) of we define 

n 2 ^ 
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Then 

(4.13) Ey™ = EY} k) = EY 1 - X k X k /a k = x- X k X k /a k . 

As we said before, we would like to show that the random variable 

Z k := (X k /a k ,y®) 

is bounded below by a constant with high probability. First, we will estimate 
its mean 

EZ k = (X k /a k ,x) — Afc||X fc ||2/a^. 

To estimate the terms in the right hand side, note that (Xj/aj, x) > 1 by (j4.4|) 
and llXfclll < n by the assumption. Now is the crucial point when we use that 
a 2 dominate Aj as in the second inequality in (14. 8p . This allows us to bound 
the "diagonal" term as 

\ k \\X k \\l/a 2 k < n/2n < n/2n = 1/2. 

As a result, we have 

(4.14) EZ k > 1 - 1/2 = 1/2. 

Step 5: control of the inner products. We would need a stronger 
statement than (14. 14ft - that Z k is bounded below not only in expectation 
but also with high probability. We will get this immediately by Chebyshev's 
inequality if we can upper bound the variance of Z k . In a way similar to Step 3, 
we estimate 

„, ri2/C' a 

Var Z k = E(Z k - EZ k f = E^X k /a k , ^ (if } - Elf > 

(4.15) < a L^ Xk /a k M k) ) 2 = — Y\ HX k /a k ,X t /a t ) 2 . 

iel[, i+k 

Now we need to estimate the various terms in the right hand side of (I4.15p . 
We start with the estimate on the inner products, collecting them into 

S := ^i( x k,Xi) 2 = ^Kph where p ki = (X k , Xi) 2 l {kf L i} . 

Recall that, by the construction of Aj and of I[ C I, we have Yliei' ^ — ^ an< ^ 
Ai < C a /n2 for i G /(. We use Lemma 14.21 on order statistics to obtain the 
bound 

„ n 2 /C a 

n 2 n 2 Ec[m] 

1=1 \E\=n 2 /C a l ^ E > l + k 
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Finally, we use our weak orthonormality assumption (14. ip to conclude that 

/ N\ 1 / r ' 
S <« (-) n. 

To complete the bound on the variance of Z k in (14.151) it remains to obtain 
some good lower bounds on a k and Since k e I 2 C I 2j (14. 8 p yields 

o n n 

a t > l— > I—- 
ln 2 ln 2 

Similarly we can bound the coefficients aj in (I4.15P : using (14. 7p we have of > 
njln\ since since i e I[ C J. But here we will not simply replace n by n, as we 
shall try to use a% to offset the term (N/n 2 ) l l r in the estimate on S. To this 
end, we note that n > (N ' jm) x l r m > (iV ' jn\) x l r n\ because m>ni. Therefore, 
using the last inequality in (14. 6 p and that N > m, we have 

(4.16) * > (EY" > (E)^. 

rix \n\J \n 2 J 
Using this, we obtain a good lower bound 

al> — >- — ) , ioiieI[. 

Combining the estimates on S, a k and a,, we conclude our lower bound 
(I4.15P on the variance of Z k as follows: 

Va I Z l <„l.^. i (^)^.(^) 1/r n 
n 2 n \JS / Vn2/ 

< ^ 2 (^jyJ (by choosing a small enough depending on r, r') 

<f(^Y <f2~ al / 2 (by fOl)) 

< a/16, (since / is large enough depending on a) 

Combining this with the lower bound (14.14p on the expectation, we conclude 
by Chebyshev's inequality the desired estimate 

(4.17) F{Z k = (X k /a k ,y (k) }>^}>l-a for k G V 2 . 

Step 6: decoupling. We are nearing the completion of the proof. Let us 
consider the good events 

E k := {(X k /a k ,y) > i and y = y {k) } for k E l' 2 . 
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To show that each £ k occurs with high probability, we note that by definition 
of y and y^ one has 

f>{y y {k) } < F{Yj Yj k) for some j G [n 2 /C' a }} 

n2/C' a 

< E p 0j ■ + Y h = 7^ ■ A * ( b y Ml)) 

3=1 

< — ■ — (by definition of I' 2 ) 

= a. (as we chose C' a = C a /a) 
From this and using ( 14 .171) we conclude that 

F{£ k } > 1 - F{(X k /a k ,yW) < -} - F{v + y (k) } > 1 - 2a for G J^. 

An application of Fubini theorem yields that with probability at least 0.9, 
at least (1 — 20a) l^l of the events £ k hold simultaneously. More accurately, 
with probability at least 0.9 the following event occurs, which we denote by £. 
There exists a subset / C V 2 of size |/| > (1 — 20ct) I/2I such that S k holds for 
all k G I. Note that using ( 14. 9 \ and choosing C a sufficiently large we have 

(4.18) (1 -21a)n 2 < \I\ < n 2 . 

Recall that the norm bound (14.1 ip also holds with high probability 0.9. 
Hence with probability at least 0.8, both £ and this norm bound holds. Let us 
fix a realization of our random variables for which this happens. Then, first of 
all, by definition of £ k we have 

(4.19) (X k /a k ,y)>^ for k G /. 

Next, we are going to observe that y lies in the span of few vectors Xj. Indeed, 
by construction y^ lies in the span of the vectors Y- k ^ for j G [nz/C^. Each 



such Yj® by construction lies in the span of the vectors X{, i G I\ \ I[ and of 

one vector vj . Finally, each such vector Vj , again by construction, is either 
equal zero or Vj, which in turn equals X io for some i 7^ k. Since £ holds, we 
have y = y^ k ' for all k G I. This implies that there exists a subset Iq C [m] 
(consisting of the indices %q as above) with the following properties. Firstly, 
Iq does not contain any of indices k G /; in other words Iq is disjoint from I. 
Secondly, this set is small: |J | < n 2 /C' a . Thirdly, y lies in the span of X^ 
i G ip U (Ii\I[). We claim that this set of indices, 

J := J U (h \ I[) 

satisfies the conclusion of the Proposition. 
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Since I and Iq are disjoint and I C V 2 C I[, it follows that / and J are 
disjoint as required. Moreover, by (14.91) and by choosing C a , C' a sufficiently 
large we have 

|J|<|/o| + |/i\/(|<^- + ^<«n 2 . 

When we combine this with (14. 18f) and choose a sufficiently small depending 
on 5, we achieve 

|J| < 5\I\ 

as required. Finally, we claim that the normalized vector 

V 

Wvh 

satisfies the conclusion of the Proposition. Indeed, we already noted that 
y G span(X,)., e j, as required. Next, for each k G / C I' 2 C I 2 we have 



(X k ,yy>— JL- (bygH) 

16 |Ml2 

n n 2 



2 



(I 



ln 2 Ini 

n , 1 f N\ irrsy 



(by gU) and gUJ) 



>i(9— . (by») 



We can get rid of I 2 in this estimate using the bound 

IV \ (H-a)r / /TZ \ (T + a)r al , . x 

> ( — J > 2^+^ (by (jUD) 



n 2 J \n 2 J 

> 2^7 > 2 q2 ' (choosing a small enough depending on r) 

> Z 2 . (since Z is large enough depending on a) 



Therefore 



(*k,2/) > — > — for G / 



where the last inequality follows by choosing a sufficiently small depending on 
r, r" . This completes the proof of Decoupling Proposition 14.11 □ 

5. Norms of random matrices with independent columns 

In this section we apply our decoupling principle, Proposition 14.11 to es- 
timate norms of random matrices with independent columns. As we said, 
a simple truncation argument of J. Bourgain [6] reduces the approximation 
problem for covariance matrices to bounding the norm of the random matrix 
YlieE Xi <8) Xi uniformly over index sets E. The following result gives such an 
estimate for random vectors X- t with finite moment assumptions. 
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Theorem 5.1. Let 1 < n < N be integers, and let 4 < p < q and t > 

1. Consider independent random vectors Xi, . . . ,X^ in M. n (1 < n < N) 
which satisfy the moment assumptions (12. 2p . Then with probability at least 
1 — Ct~°' 9q , for every index set E C [N], \E\ > 4, one has 



X: 



~p 



, q , K , L t 2 (\oglog\E\f 



N \ 4 /p, 

// : ( — ) IA1 



We can state Theorem 15.11 in terms of random matrices with independent 
columns. 



Corollary 5.2. Let 1 < n < N be integers, and let 4 < p < q and t > 1. 

Consider the n x N random matrix A whose columns are independent random 
vectors -Xj,... ,X N in M. n which satisfy (12.21) . Then with probability at least 
1 - Ct-°- 9q one has 

\\A\\ < p , q , K , L tloglogN-(VK+VN). 

Moreover, with the same probability all n x m submatrices B of A simultane- 
ously satisfy the following for all A<m < N : 



\B\\ < Ptq ,K,Lt\og\ogm ■ v^+( 



-N\V p 



Proof of Theorem \5.1[ By replacing Xi with Xi/max(K } L) we can assume 
without loss of generality that K — L — 1. As we said, the argument will 
be based on Decoupling Proposition 14.11 Its assumptions follow from known 
results. Indeed, the pairwise almost orthogonality of the vectors Xi follows 
from Lemma [2.4[ which yields (12. 5 p with probability at least 1 — Ct~ q . Also, 
the required bound on the weak £ 2 norm follows from Theorem 12.31 which gives 
with probability at least 1 — Ct~ 0,9q that 



(5.1) sup WdX^x^jWl^^t 2 

xes n ~ 1 



" + [r T\) 11 



for I C[N). 



Consider the event £ that both required bounds (12.51) and (15.11) hold. 

Let Sq denote the event in the conclusion of the Theorem. It remains to 
prove that P(£q an d £) is small. To this end, assume that £ holds but £ does 
not. Then there exists an index set E C [N] whose size we denote by m := \E\, 
and which satisfies 



x. 



sup 



E 

i£E 



(X h x) 2 > m t 2 (loglog 



jV\4/p ■ 

n + ( — ) m 

mJ 



Recalling (12 .5p and (15. ip we see that the assumptions of Decoupling Proposi- 
tion KB hold for 1/r = A/p, 1/r' = 4/q, r" = r', K x = K = 1, K 2 = C q Vi 
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for suitably large C q , K 3 = max(Ki, K 2 , 100t 2 ), and for 5 = 5(p,q) > suf- 
ficiently small (to be chosen later). Applying Decoupling Proposition 14. II we 
obtain disjoint index sets I,JQEQ [N] with sizes 

|/| =: s, \J\ < 5s, 

and a vector y G S 11 ^ 1 D span(X J ) je j such that 

/ N\ 1 /r" 

(5.2) (Xi, y) 2 > 100t 2 \— J for i G I. 

We will need to discretize the set of possible vectors y. Let 

'5s\ 5 

and consider an e-net Mj of the sphere S 1 ™ -1 D span(X,),, g j. As in known by 
a volumetric argument (see e.g. [19] Lemma 2.6), one can choose such a net 
with cardinality 

Wj\ < (3/5)' J l <y 5sJ ■ 

We can assume that the random set Mj depends only on the number e, the 
set J and the random variables (Xj)j & j. Given a vector y as we have found 
above, we can approximate it with some vector y G J\fj so that ||y — 2/o 1 1 2 < e - 
By (15. II) we have 

r /N\ 1 / r ' i 

||((X J ,t/- Z /o)) Je /||2,oo< g ^ 2 ^+(7J S ' 

This implies that all but at most 5s indices i in I satisfy the inequality 



2+2 



e z t 



N\l/r> 



n + ( — ) 



(5.3) (X u y-y Q ) 2 < q — 

Let us denote the set of these indices by Iq C /. The bound in (15. 3ft can be 
simplified as 

£_ n + ( — ) ' s < 25. 
dsl Is/ J 

Indeed, this estimate follows from the two bounds 



-2 r ,jy. l/ r ' 



2 /5s\ 10 72 

• n < [ — ) • — < 5 (because n < N): 



5s 

e 2 /N\ 1 / r ' 1 /5s\ 10 /N\ !/»•' 

In particular, by choosing 5 = 5(q) > sufficiently small, (15.31) implies 

\(X»V- Vo)\ < * for » G J o- 
Together with (15.21) this yields by triangle inequality that 

|(*i,j/o>| >10t(yJ -*>9t(— J for^G/o. 
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Summarizing, we have shown that the event and 8} implies the following 
event: there exists a number s < N, disjoint index subsets Iq, J C [N] with 
sizes \Iq\ > (1 — S)s, \ J\ < 5s, and a vector y G Nj such that 



\(Xi,y )\ > 9t(—) fori el, 



It will now be easy to estimate the probability of this event. First of all, for 
each fixed vector y G S*™ -1 and each index i, the moment assumptions (12. 2p 
imply via Markov's inequality that 



m r./ v i i /JV\V2r" 1 ,N\-l/1r" 1 /JV\- 2 

p{|(x„wi>«( T ) }<<^( T ) <^( T ) 



where the last line follows from our choice of g and r". By independence, for 
each fixed vector y Q G S 1 " -1 and a fixed index set Iq C [iV] of size |/q| > (1 — <5) s 
we have 



P{K*i,Jfo>|>9(-) for^G/ }< — ( 



■N 



9*9 V s 



jV\ -2(1-5)* 



(5.4) < 9 -(i-^ r (i-%^) 

Then we bound the probability of event {£q and 8} by taking the union bound 
over all s, Iq, J as above, conditioning on the random variables (Xj)j E j (which 
fixes the e-net A/j), taking the union bound over the choice of y G Afj, and 
finally evaluating the probability for using (15. 4p . This way we obtain via 
Stirling's approximation of the binomial coefficients that 

W and S) < ± Q r^^)^ 



8=1 

N 



< t 99 (2") cn00Sm S 5 > small enough) 

s=l ~ 

< r°- 9q . 

It follows that 

P{£ C } < F {^o c an d £} + p {£ c ) < t~°' 9q + cr q + cr°- 9q < t~ - 9q . 

This completes the proof of Theorem 15.11 □ 
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6. Approximating covariance matrices 

In this final section, we deduce our main result on the approximation of 
covariance matrices for random vectors with finite moments. 

Theorem 6.1. Consider independent random vectors X%, . . . , in W 1 , 4 < 
n < N , which satisfy moment assumptions (12. 2p for some q > 4 and some K , 
L. Then for every 5 > with probability at least 1—5 one has 

1 N \ --- 

(6.1) -J^OXi-EXi®^ ^^(loglogn)^^) 2 \ 

i=i 

In our proof of Theorem 16.11 we can clearly assume that K = L = 1 in the 
moment assumptions (12. 2ft by rescaling the vectors Xj. So in the rest of this 
section we suppose Xi are such random vectors. 

For a level B > and a vector x G S 1 ™ -1 , we consider the (random) index 
set of large coefficients 

E B = E B {x):={ie[N}: \(X t ,x)\>B}. 

Lemma 6.2 (Large coefficients). Let t > 1. probability at least 1 — Ct°' 9q , 

one has 

\Eb\ < q n/B 2 + N(t/B) q/2 for B > 0. 

Proof. This estimate follows from Theorem 12.31 By definition of the set Eb 
and the weak £2 norm, we obtain with the required probability that 

B 2 \E B \<\\((X h x)) ieE Jl^< q n + t 2 (—) \E B \. 

Solving for we obtain the bound as in the conclusion. □ 



Proof of Theorem \b.l\ The truncation argument described in pQ in the be- 
ginning of proof of Proposition 4.3 reduces the problem to estimating the 
contribution to the sum of large coefficients. Denote 

\\ \ N I 1 N 

E=\\—Y j X l ®X i -^X l ®X i = sup — V(X i ,x) 2 -E(X i ,x)' 
II 1\ z — ^ xes™- 1 1 ^ v ~ 



i=l 



i=l 



The truncation argument yields that for every B > 1, one has with probability 
at least 1 — 5/3 that 



E <„ S + sup _ 1 ^ (A',, «>' + sup _ Y. (X„4- 

(6.2) =: h + 1 2 + h- 



i&E B 
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We choose the value of the level 

so that, using Lemma [6.21 with probability at least 1 — 5/3 we have 

(6.3) \E B \< q , s n. 

It remains to estimate the right hand side of (16. 2p using (16.31) . 
First, we clearly have 

'-(£) H - 

An estimate of I2 follows from Theorem 15.11 for some p = p(q) G (4, q) to be 
determined later. Note that enlarging Eb can only make I2 and ^3 larger. 
So without loss of generality we can assume that \Eb\ > 4 as required in 
Theorem 15.11 This way, we obtain with probability at least 1 — 5/3 that 



1 r / iV \ 4 /p 

h < q ,5 ^{\og\og\E B \) 2 [n+[—-A \E B \ 



B 

1-2 



(loglogn) 2 [^+(^) ~'\ (by@ 



Finally, to estimate I3 let us fix x and consider the random variable Zi = 
\(Xi,x)\. Since KZf < 1, an application of Holder's and Markov's inequalities 
yield 

®Z?l {Zi > B} < (EZf)^(P(Z, > B)f-^ < < q , s 2 ~ 4 
Therefore 



1 N 

1 ^-^ 2 / n 



i=1 



Since we are free to choose p = p(q) in the interval (4, q), we choose the 
middle of the interval, p = (q + 4)/2. Returning to (16. 2p we conclude that 



E< q ,s (loglogn) 2 (-^) H . 
This completes the proof of Theorem 16.11 □ 
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